12/03/2019
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
In the following lectures, we will go through a relatively complete example of the process of data analysis, regression, diagnostics, and remediation.
Particularly, we will discuss a difficult to analyze question of systematic bias in insurance practices.
Insurance, by nature of their business, need to price the cost of covering the risk of damages, based on the probability of these damages.
However, we are also in a society that has historically enforced segregation, unequal rights and unequal access to services.
We will investigate a complicated question:
The term “insurance redlining” refers litterally drawing a red line in a map, that excludes certain neighborhoods from services.
In the late 1970s, the US Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods.
Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978 was recorded.
The FAIR plan was offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market.
Information on other variables that might affect insurance writing such as fire and theft rates was also collected at the zip code level.
We note here that we do not actually know the ethnicity of the individuals who are denied insurance in this data set.
Rather, we only know the ethnic makeup of the zip code where an individual was denied.
This is an important difficulty that needs to be considered carefully before starting our analysis.
When data are collected at the group level, we may observe a correlation between two variables.
The ecological fallacy is concluding that the same correlation holds at the individual level.
For example the ecological fallacy was discussed in a court challenge to the Washington gubernatorial election of 2004;
library("faraway")
head(eco)
usborn income home pop
Alabama 0.98656 21442 75.9 4040587
Alaska 0.93914 25675 34.0 550043
Arizona 0.90918 23060 34.2 3665228
Arkansas 0.98688 20346 67.1 2350725
California 0.74541 27503 46.4 29760021
Colorado 0.94688 28657 43.3 3294394
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(income ~ usborn, data=eco, xlab="Proportion US born", ylab="Mean Annual Income", col=1, cex=3, cex.lab=3, cex.axis=1.5)
lmod <- lm(income ~ usborn, eco)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(income ~ usborn, data=eco, xlab="Proportion US born", ylab="Mean Annual Income",xlim=c(0,1),ylim=c(15000,70000),xaxs="i", col=1, cex=3, cex.lab=3, cex.axis=1.5)
abline(coef(lmod))
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68642.2 8739.0 7.8547 3.188e-10
usborn -46018.6 9279.1 -4.9594 8.891e-06
n = 51, p = 2, Residual SE = 3489.54138, R-Squared = 0.33
Q: can we conclude, therefore, that naturalized citizens earn more than US born citizens in general?
A: no, this is precisely the ecological fallacy.
From the US Census Bureau, we know that generally US born citizens earn slightly more than their naturalized peers.
It is unreasonable to conclude that naturalized citizens have higher income in general just because there is a correlation at the state demographic level between more naturalized citizens and higher per-capita income.
In our case, with the Chicago insurance data, we have to note likewise the danger of the ecological fallacy.
Particularly, we only have aggregate data on the zip code level describing the ethnic composition of different neighborhoods, and the number of FAIR plan policies per neighborhood.
Even if there is a strong correlation between neighborhoods that have a high proportion of minority residents and a higher number of fair plans,
Dealing with this subtlety will be a challenge of this dataset in terms of drawing conclusions.
We start by performing some exploratory analysis:
head(chredlin, n=4)
race fire theft age involact income side
60626 10.0 6.2 29 60.4 0.0 11.744 n
60640 22.2 9.5 44 76.5 0.1 9.323 n
60613 19.6 10.5 36 73.5 1.2 9.948 n
60657 17.3 7.7 37 66.9 0.5 10.656 n
summary(chredlin)
race fire theft age
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60
Median :24.50 Median :10.40 Median : 29.00 Median :65.00
Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33
3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30
Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10
involact income side
Min. :0.0000 Min. : 5.583 n:25
1st Qu.:0.0000 1st Qu.: 8.447 s:22
Median :0.4000 Median :10.694
Mean :0.6149 Mean :10.696
3rd Qu.:0.9000 3rd Qu.:11.989
Max. :2.2000 Max. :21.480
We see that there is a wide range of values for “race” and that the values are skewed towards zero versus 100.
The third quartile is just over \( 57\% \), indicating that about a quarter of the zip codes in Chicago are majority minority neighborhoods.
This is a useful fact for our analysis, because if all neighborhoods were homogeneous (approximately equal percentages for all zip codes) we wouldn't be able to distinguish any differences from our aggregate data.
summary(chredlin)
race fire theft age
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60
Median :24.50 Median :10.40 Median : 29.00 Median :65.00
Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33
3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30
Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10
involact income side
Min. :0.0000 Min. : 5.583 n:25
1st Qu.:0.0000 1st Qu.: 8.447 s:22
Median :0.4000 Median :10.694
Mean :0.6149 Mean :10.696
3rd Qu.:0.9000 3rd Qu.:11.989
Max. :2.2000 Max. :21.480
We see likewise that theft and income are skewed varaibles.
Also, the number of FAIR plans has many zeros, including the entire first quartile.
This is potentially an issue for our linear model assumptions that we have used so far…
We will perform some quick exploratory plots to examine relationships among variables.
require(ggplot2)
ggplot(chredlin,aes(race,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
ggplot(chredlin,aes(fire,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
ggplot(chredlin,aes(theft,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
With theft, we appear to have a point with a high leverage that is influencing the linear fit relationship between the FAIR coverage and theft
ggplot(chredlin,aes(age,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
With age, there appears to be a linear relationship with older housing and higher number of FAIR plans;
ggplot(chredlin,aes(income,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
With income, there appears to be some relationship between higher income and a lower number of FAIR plans, but outliers and nonlinearity appear to affect the relationship.
Particularly, the high income observation appears to have high leverage, and to produce unphysical (negative) values for the relationship.
ggplot(chredlin,aes(side,involact)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
sumary(lm(involact ~ race,chredlin))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1292180 0.0966106 1.3375 0.1878
race 0.0138824 0.0020307 6.8361 1.784e-08
n = 47, p = 2, Residual SE = 0.44883, R-Squared = 0.51
There is a non-zero parameter, with almost \( 100\% \) confidence — we have almost no doubt that there is some statistical relationship between neighborhoods with high minority composition and higher number of FAIR plans.
However, we have to note at this moment:
Indeed, there may be a number of confounding variables that may provide a better explanation.
It could be legitimate buisness practice on the part of insurance companies to deny normal insurance when a neighborhood has too high of a risk;
ggplot(chredlin,aes(race,fire)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
ggplot(chredlin,aes(race,theft)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
When plotting theft versus race, we see a trend of slightly higher rate of theft with neighborhoods of higher minority concentration.
We note, also, this plot has an extreme outlier that may be important for the analysis – this might be skewing the trend positively in a way that isn't consistent with the overall behaviour of the city.
We move to the question of model selection, to determine the best possible predictors for the number of FAIR plans in a neighborhood.
However, this is fundamentally an explanatory model in which we want to quantify the effect of the ethnic composition of a neighborhood on the number of plans.
With this purpose, we will thus emphasize techniques which help analyze the explanatory power of the model.
This also indicates an important consideration, where we need to control for certain variables in the model.
In this case, we will belabor some of the model selection (and the sensitivity of the model parameters) so that we can make conclusions and explanations with greater confidence.
One change that we will make is to use log of income, as it is often done, due to the skewness of the variable; this is also because money operates practically in society on a logarithmic scale.
library("leaps")
sum_life <- summary(regsubsets(involact~ race + fire + theft + age + log(income) + side, data=chredlin))
sum_life
Subset selection object
Call: regsubsets.formula(involact ~ race + fire + theft + age + log(income) +
side, data = chredlin)
6 Variables (and intercept)
Forced in Forced out
race FALSE FALSE
fire FALSE FALSE
theft FALSE FALSE
age FALSE FALSE
log(income) FALSE FALSE
sides FALSE FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
race fire theft age log(income) sides
1 ( 1 ) "*" " " " " " " " " " "
2 ( 1 ) "*" "*" " " " " " " " "
3 ( 1 ) "*" "*" "*" " " " " " "
4 ( 1 ) "*" "*" "*" "*" " " " "
5 ( 1 ) "*" "*" "*" "*" "*" " "
6 ( 1 ) "*" "*" "*" "*" "*" "*"
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$bic, xlab = "Number explanatory variables", ylab = "Bayesian Information Criterion", col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$adjr2, xlab = "Number explanatory variables", ylab = "Adjusted R squared", col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$cp, xlab = "Number explanatory variables", ylab = "Mallows Cp", col=1, cex=3, cex.lab=3, cex.axis=1.5)
abline(0,1)
lmod <- lm(involact ~ race + fire + theft + age + log(income), data=chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1855396 1.1002549 -1.0775 0.2875500
race 0.0095022 0.0024896 3.8168 0.0004485
fire 0.0398560 0.0087661 4.5466 4.758e-05
theft -0.0102945 0.0028179 -3.6533 0.0007276
age 0.0083356 0.0027440 3.0377 0.0041345
log(income) 0.3457615 0.4001234 0.8641 0.3925401
n = 47, p = 6, Residual SE = 0.33453, R-Squared = 0.75
ggplot(chredlin,aes(race,income)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
cor(chredlin$income,chredlin$race)
[1] -0.7037328
The strong anti-correlation of log-income and race makes the variable selection take on a political dimension.
The best case scenario would be that we would perform the model analysis both with and without the log-income and compare the results
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,1:1, col=1, cex=3, cex.lab=3, cex.axis=1.5)
lmod$fitted[lmod$fitted < 0 ]
60611 60656 60638 60652 60655
-0.05706070 -0.23408790 -0.05930948 -0.28650119 -0.15827619
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,1:1, col=1, cex=3, cex.lab=3, cex.axis=1.5)
These small negative numbers are problematic, due in part to the high number of zero values for the FAIR plans;
Unfortunately, a scale change won't fix this realistically — a square root will make nominal difference, but not enough to justify the loss of interpretability of the response.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,2, col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,3, col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,5, col=1, cex=3, cex.lab=3, cex.axis=1.5)
To verify the outlier analysis numerically, we can perform the Bonferroni corrected \( 5\% \) signficance test on the Studentized Residuals.
Remember, the parameters are \( \frac{\alpha}{2 * n} \) for the correct critical value of the t distribution of \( n-p-1 \) degrees of freedom:
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1855396 1.1002549 -1.0775 0.2875500
race 0.0095022 0.0024896 3.8168 0.0004485
fire 0.0398560 0.0087661 4.5466 4.758e-05
theft -0.0102945 0.0028179 -3.6533 0.0007276
age 0.0083356 0.0027440 3.0377 0.0041345
log(income) 0.3457615 0.4001234 0.8641 0.3925401
n = 47, p = 6, Residual SE = 0.33453, R-Squared = 0.75
stud <- rstudent(lmod)
alpha_crit <- qt(0.05/(2 * 47), 40)
abs(stud) > abs(alpha_crit)
60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60646 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60623 60608 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60652 60620 60619 60649 60617 60655 60643 60628 60627 60633 60645
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We note that there is a structural issue arising due to the high number of zero FAIR plans (the response variable), but that there isn't a realistic choice of scale transformation to fix this.
Normality and variance assumptions of the error are otherwise OK.
We will probably need to evaluate the effect of points of high leverage in the analysis.
We wish thus to study the sensitivity of the main parameter of interest “race”, with respect to the leverage points and controlling for covariates in the analysis.
We will begin by looking at partial residual plots, which are used for nonlinearity detection when factoring out for covariates.
We will also look at the confidence intervals for the parameters.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=1, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[2]
race
0.009502223
confint(lmod, parm = "race")
2.5 % 97.5 %
race 0.004474458 0.01452999
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=2, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[3]
fire
0.03985604
confint(lmod, parm = "fire")
2.5 % 97.5 %
fire 0.02215246 0.05755963
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=3, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[4]
theft
-0.01029451
confint(lmod, parm = "theft")
2.5 % 97.5 %
theft -0.01598536 -0.004603655
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=4, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[5]
age
0.0083356
confint(lmod, parm = "age")
2.5 % 97.5 %
age 0.002793968 0.01387723
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=5, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[6]
log(income)
0.3457615
confint(lmod, parm = "log(income)")
2.5 % 97.5 %
log(income) -0.4623041 1.153827
We make note, however, that while the confidence intervals for theft are small, the analysis may be strongly affected by the observation of high leverage.
This is something we will consider later in the analysis.
With the generally good diagnostics, in terms of Gaussianity and constant variance of the errors (with some qualifications) we might make some conclusions.
Loosely speaking, it appears that there is a small, but statistically significant, effect in which neighborhoods with higher concentrations of ethnic minorities have a higher number of FAIR plans, when all other variables are held equal.
This effect is not nearly as strong as e.g., the rate of fires with all other variables held equal.
Our analysis is not even nearly complete, and we cannot say if any application for insurance was rejected based upon their ethnic background;
In terms of analyzing the sensitivity of the race parameter with respect to the inclusion or exclusion of other covariates, we can automate this to determine if the effect on the response changes dramatically.
This is similar to the case when we used matching in the voting districts in New Hampshire, where we needed to see if an additional covariate was acting as a confounding variable on our analysis.
By analyzing the inter-model stability of this parameter, we have a better sense of if this explanation is actually robust.
beta pvalue
race 0.0139 0.0000
race + fire 0.0089 0.0002
race + theft 0.0141 0.0000
race + age 0.0123 0.0000
race + log ( income ) 0.0082 0.0087
race + fire + theft 0.0082 0.0002
race + fire + age 0.0089 0.0001
race + fire + log ( income ) 0.0070 0.0160
race + theft + age 0.0128 0.0000
race + theft + log ( income ) 0.0084 0.0083
race + age + log ( income ) 0.0099 0.0017
race + fire + theft + age 0.0081 0.0001
race + fire + theft + log ( income ) 0.0073 0.0078
race + fire + age + log ( income ) 0.0085 0.0041
race + theft + age + log ( income ) 0.0106 0.0010
race + fire + theft + age + log ( income ) 0.0095 0.0004
The above output shows the parameter for race and the associated p-values for all 16 models.
There is some variance in the magnitude of the effect, depending on the other variables we control for, but in no case does the p-value rise above \( 5\% \) or does the parameter change sign.
This suggests some uncertainty over the magnitude of the effect, we can be sure that the significance of the effect is not sensitive to the choice of adjusters.
If we suppose that we were able to find models in which the composition of the neighborhood in terms of minorities was not statistically significant, our ability to control for other factors would be more difficult.
We would then need to consider more deeply which covariates should be adjusted for and which not.
This level of model analysis and selection is at the heart of many real world problems, and we are fortunate as modellers in this case that it is unnecessary.
In general, however, this is the reality of real problems you will encounter after this class.
Now, we want to analyze the effect of leverage and influence on the model and our interpretation.
We recall that there are several influential points, and we wish to determine if their inclusion or exclusion would radically change the interpretation for the parameter for race.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(dfbeta(lmod)[,2],ylab="Change in race coef", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
diags <- data.frame(lm.influence(lmod)$coef)
ggplot(diags,aes(row.names(diags), race)) + geom_linerange(aes(ymax=0, ymin=race)) + theme(axis.text.x=element_text(angle=90), axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold")) + xlab("ZIP") + geom_hline(yintercept = 0)
ggplot(diags,aes(x=fire,y=theft))+geom_text(label=row.names(diags)) +
theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
chredlin[c("60607","60610"),]
race fire theft age involact income side
60607 50.2 39.7 147 83.0 0.9 7.459 n
60610 54.0 34.1 68 52.6 0.3 8.231 n
lmode <- lm(involact ~ race + fire + theft + age + log(income), chredlin,subset=-c(6,24))
sumary(lmode)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5767365 1.0800462 -0.5340 0.59638
race 0.0070527 0.0026960 2.6160 0.01259
fire 0.0496474 0.0085701 5.7931 1.004e-06
theft -0.0064336 0.0043489 -1.4794 0.14708
age 0.0051709 0.0028947 1.7863 0.08182
log(income) 0.1157028 0.4011132 0.2885 0.77453
n = 45, p = 6, Residual SE = 0.30320, R-Squared = 0.8
We have verified that our conclusions are also robust to the exclusion of one or perhaps two cases from the data.
In any case, it would be very important to disclose this choice in the analysis, and any suppression of the information would be extremely dishonest.
x <- model.matrix(lmod)
x0 <- apply(x,2,median)
x0 <- data.frame(t(x0))
# note here that R won't accept the variable that we've rescaled until we re-name it back to its original name...
colnames(x0)[6] <- "income"
predict(lmod,new=x0,interval="prediction")
fit lwr upr
1 0.003348934 -1.398822 1.40552
predict(lmod,new=x0,interval="confidence")
fit lwr upr
1 0.003348934 -1.225333 1.232031
We have shown very robust evidence for the statistically significant effect of the neighborhood's ethnic composition, with all other variables held equal, having a significant effect in determining the number of FAIR plans in the neighborhood.
This doesn't mean conclusively, however, that there was systematic discrimination.
If we tried another hypothetical model, supressing theft and the two high influence observations:
modalt <- lm(involact ~ race+fire+log(income),chredlin, subset=-c(6,24))
sumary(modalt)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7532557 0.8358798 0.9012 0.37277
race 0.0042061 0.0022757 1.8483 0.07178
fire 0.0510220 0.0084504 6.0378 3.822e-07
log(income) -0.3623825 0.3191620 -1.1354 0.26279
n = 45, p = 4, Residual SE = 0.30919, R-Squared = 0.79
the parameter for race is no longer significant.
This is just a hypothetical example, because our relatively exhaustive analysis has shown that this is a cherry-picked model, versus the other possible models.
Generally, however, this shows some of the subtlety in statistical analysis like this
Part of robust analysis is thus to perform several model selections and the associated diagnostics to deal with the overall uncertainty in the model selection;
The strength of our conclusions will increase if we can show that several different models all have similar interpretations.
If there are several reasonable models with different interpretations or conclusions, there is a fundamental issue in the data, and we should be cautious and be skeptical if there is really a signal to be found in the data.
Generally, we have performed rigorous analysis in the above (pending completion of additional analysis).
However, there are still several ways we should be cautious about our conclusions:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "s"),chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9946613 1.6949252 -1.1768 0.256470
race 0.0093509 0.0046055 2.0304 0.059285
fire 0.0510812 0.0170314 2.9992 0.008493
theft -0.0102565 0.0090972 -1.1274 0.276184
age 0.0067778 0.0053061 1.2774 0.219700
log(income) 0.6810543 0.6493336 1.0489 0.309832
n = 22, p = 6, Residual SE = 0.34981, R-Squared = 0.76
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "n"),chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7902939 1.7793768 -0.4441 0.66196
race 0.0133200 0.0053903 2.4711 0.02310
fire 0.0246654 0.0154219 1.5994 0.12623
theft -0.0079383 0.0039815 -1.9938 0.06073
age 0.0090040 0.0046471 1.9375 0.06769
log(income) 0.1666853 0.6233641 0.2674 0.79204
n = 25, p = 6, Residual SE = 0.35115, R-Squared = 0.76
Generally, sub-dividing data into smaller and smaller groups, we can dillute the significance.
However, as noted before, aggregating data too much causes its own issues – the balance between these is contextual and subjective based on our understanding of the task at hand.
ggplot(chredlin,aes(side,race)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
summary(chredlin$race[chredlin$side=="n"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.80 10.00 21.95 24.50 94.40
summary(chredlin$race[chredlin$side=="s"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 34.27 48.10 49.80 72.92 99.70
We performed initial, exploratory analysis, checking for structure in the data, and in the plots of variables versus the response.
We fit a model based on a mix of criterion methods and some qualitative judgement.
We performed diagnostics to see how our assumptions might be breaking down – aside from an unavoidable issue with the many zero observations for the FAIR plan, it was a well functioning model.
We evaluated the partial residuals, and confidence intervals for each of the parameters in our model.
As this is an explanatory model, we were interested in “predicting” the parameter for “race”;
We also demonstrated robustness to the exclusion of observations, and checked formally for outliers that could affect the fit.
Just for fun, we saw that it has awful predictive power…
We made a final assessment of the uncertainty that we could not account for with our data.
After all this analysis, everything we say needs to be hedged with “ifs” and “buts” and qualified carefully.
This is the reality of statistical modeling, where causual conclusions are extremely difficult to draw.
To quote Farway, quoting Winston Churchill:
“Indeed, it has been said that Democracy is the worst form of government, except all those other forms that have been tried from time to time.”
We might say the same about statistics with respect to how it helps us reason in the face of uncertainty.
Missing data causes both technical data processing and statistical challenges in regression analysis.
By creating a framework for different kinds of missing data, we have some mathematical tools for understanding how it will affect our analysis.
Certain kinds of missing data are harder than others to treat mathematically – at times, we won't have good tools for filling missing data values, or treating the implicit bias therein.
However, certain kinds of missing data can be modeled statistically, along with the uncertainty when treating these cases.
Here is a summary of different missing data regimes we might find ourselves:
The following types of missing values are distinguished as follows:
Amongst the above, when the data is MAR, we can adjust based on observed variables and therefore handle missingness and bias mathematically – we will focus on this situation.
If we wish to understand methods of treating missing data, we can take a data set and delete values to compare how conclusions might be affected by our treatment.
Prototypically, we will study the Chicago Insurance data once again, but with values missing at random.
library("faraway")
summary(chmiss)
race fire theft age
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
1st Qu.: 3.75 1st Qu.: 5.60 1st Qu.: 22.00 1st Qu.:48.30
Median :24.50 Median : 9.50 Median : 29.00 Median :64.40
Mean :35.61 Mean :11.42 Mean : 32.65 Mean :59.97
3rd Qu.:57.65 3rd Qu.:15.10 3rd Qu.: 38.00 3rd Qu.:78.25
Max. :99.70 Max. :36.20 Max. :147.00 Max. :90.10
NA's :4 NA's :2 NA's :4 NA's :5
involact income
Min. :0.0000 Min. : 5.583
1st Qu.:0.0000 1st Qu.: 8.564
Median :0.5000 Median :10.694
Mean :0.6477 Mean :10.736
3rd Qu.:0.9250 3rd Qu.:12.102
Max. :2.2000 Max. :21.480
NA's :3 NA's :2
rowSums(is.na(chmiss))
60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631
1 0 1 1 0 0 0 0 1 1 0 0
60646 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607
1 0 0 1 0 0 0 1 1 0 0 1
60623 60608 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637
0 1 1 0 1 0 0 0 1 0 1 0
60652 60620 60619 60649 60617 60655 60643 60628 60627 60633 60645
0 1 0 1 1 0 0 1 0 0 1
Here there is at most one value missing from each row; likewise, the missing data is basically evenly spaced in the data.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
image(is.na(chmiss),axes=FALSE,col=gray(1:0))
axis(2, at=0:5/5, labels=colnames(chmiss), cex=3, cex.lab=3, cex.axis=1.5)
axis(1, at=0:46/46, labels=row.names(chmiss),las=2, cex=3, cex.lab=3, cex.axis=1.5)
Suppose we favor a deletion approach to all samples with missing values.
We will compare the full model with all values versus the situation in which we delete missing values:
modfull <- lm(involact ~ . - side, chredlin)
sumary(modfull)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6089790 0.4952601 -1.2296 0.2258512
race 0.0091325 0.0023158 3.9435 0.0003067
fire 0.0388166 0.0084355 4.6015 4e-05
theft -0.0102976 0.0028529 -3.6096 0.0008269
age 0.0082707 0.0027815 2.9734 0.0049143
income 0.0245001 0.0316965 0.7730 0.4439816
n = 47, p = 6, Residual SE = 0.33513, R-Squared = 0.75
modmiss <- lm(involact ~ ., chmiss)
sumary(modmiss)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1164827 0.6057615 -1.8431 0.0794750
race 0.0104867 0.0031283 3.3522 0.0030180
fire 0.0438757 0.0103190 4.2519 0.0003557
theft -0.0172198 0.0059005 -2.9184 0.0082154
age 0.0093766 0.0034940 2.6837 0.0139041
income 0.0687006 0.0421558 1.6297 0.1180775
n = 27, p = 6, Residual SE = 0.33822, R-Squared = 0.79
We may consider thus to “fill-in” data into the missing values for various samples – this is known as data imputation.
One approach is to fill in values by something “representative” of the known population,
(cmeans <- colMeans(chmiss,na.rm=TRUE))
race fire theft age involact income
35.6093023 11.4244444 32.6511628 59.9690476 0.6477273 10.7358667
mchm <- chmiss
for(i in c(1:4,6)) mchm[is.na(chmiss[,i]),i] <- cmeans[i]
imod <- lm(involact ~ ., mchm)
sumary(imod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0708021 0.5094531 0.1390 0.890203
race 0.0071173 0.0027057 2.6305 0.012245
fire 0.0287418 0.0093855 3.0624 0.004021
theft -0.0030590 0.0027457 -1.1141 0.272242
age 0.0060795 0.0032079 1.8952 0.065695
income -0.0270917 0.0316782 -0.8552 0.397791
n = 44, p = 6, Residual SE = 0.38412, R-Squared = 0.68
In this case, there isn't just loss of fit, but also qualitative differences in the parameters.
Particularly, theft and age have lost significance in this model versus the iterations previously seen.
Also, the parameters themselves are smaller in magnitude, describing less “effect” in the signal overall.
In this case, we see significant bias induced by the imputation (toward the mean values);
In the case that there is a small number of missing values for a categorical variable, we can typically model the “missing-value” as a category of its own.
We can consider a more sophisticated approach for handling missing values using regression.
Particularly, if the variables are strongly (anti)-correlated with each other, there is information in the explanatory variables telling how they vary together (or oppostitely).
Recall, for percent variables, it is sometimes useful to make a change of scale to the real line when this is a response.
The logit transformation is given as, \( y \mapsto \log\left(\frac{y}{1 - y}\right) \)
The “logit” and “ilogit” (inverse) map are available in the Faraway package.
We will model the percent ethnic minority in a zip code as regressed on the other variables with the missing cases removed:
lmodr <- lm(logit(race/100) ~ fire+theft+age+income,chmiss)
ilogit(predict(lmodr,chmiss[is.na(chmiss$race),]))*100
60646 60651 60616 60617
0.4190909 14.7320193 84.2653995 21.3121261
chredlin$race[is.na(chmiss$race)]
[1] 1.0 13.4 62.3 36.4
Two of the predictions are reasonable, but two are significantly off.
This can be performed for each of the explanatory variables, and while preferable to mean imputation, still leads to bias.
If we understand that the loss of the variation of the population is the issue with the earlier approaches, we can try to enforce some variance in the imputation mathematically.
We will create 25 different versions of the data “re-sampled” back with uncertainty for each missing value.
The known values will be the same, but we will have “m” different versions of the missing values, drawn from a “Bayesian posterior” estimate…
library(Amelia)
set.seed(123)
chimp <- amelia(chmiss, m=25)
-- Imputation 1 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34
-- Imputation 2 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74
-- Imputation 3 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-- Imputation 4 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
-- Imputation 5 --
1 2 3 4 5 6 7 8 9
-- Imputation 6 --
1 2 3 4 5 6 7 8 9 10 11 12 13
-- Imputation 7 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43
-- Imputation 8 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83
-- Imputation 9 --
1 2 3 4 5 6 7 8 9
-- Imputation 10 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
-- Imputation 11 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51
-- Imputation 12 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
-- Imputation 13 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
141 142 143
-- Imputation 14 --
1 2 3 4 5 6 7
-- Imputation 15 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25
-- Imputation 16 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22
-- Imputation 17 --
1 2 3 4 5 6 7 8 9
-- Imputation 18 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54
-- Imputation 19 --
1 2 3 4 5 6 7 8 9 10
-- Imputation 20 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
121 122 123 124 125 126 127 128 129 130
-- Imputation 21 --
1 2 3 4 5 6 7 8
-- Imputation 22 --
1 2 3 4 5 6 7 8
-- Imputation 23 --
1 2 3 4 5 6 7 8 9 10 11 12 13
-- Imputation 24 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29
-- Imputation 25 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
str(chimp, 1)
List of 12
$ imputations:List of 25
..- attr(*, "class")= chr [1:2] "mi" "list"
$ m : num 25
$ missMatrix : logi [1:47, 1:6] FALSE FALSE FALSE FALSE FALSE FALSE ...
..- attr(*, "dimnames")=List of 2
$ overvalues : NULL
$ theta : num [1:7, 1:7, 1:25] -1 0.01738 0.00786 -0.06694 -0.00626 ...
$ mu : num [1:6, 1:25] 0.01738 0.00786 -0.06694 -0.00626 -0.10637 ...
$ covMatrices: num [1:6, 1:6, 1:25] 0.977 -0.517 0.609 0.532 0.506 ...
$ code : num 1
$ message : chr "Normal EM convergence."
$ iterHist :List of 25
$ arguments :List of 23
..- attr(*, "class")= chr [1:2] "ameliaArgs" "list"
$ orig.vars : chr [1:6] "race" "fire" "theft" "age" ...
- attr(*, "class")= chr "amelia"
str(chimp$imputations, 1)
List of 25
$ imp1 :'data.frame': 47 obs. of 6 variables:
$ imp2 :'data.frame': 47 obs. of 6 variables:
$ imp3 :'data.frame': 47 obs. of 6 variables:
$ imp4 :'data.frame': 47 obs. of 6 variables:
$ imp5 :'data.frame': 47 obs. of 6 variables:
$ imp6 :'data.frame': 47 obs. of 6 variables:
$ imp7 :'data.frame': 47 obs. of 6 variables:
$ imp8 :'data.frame': 47 obs. of 6 variables:
$ imp9 :'data.frame': 47 obs. of 6 variables:
$ imp10:'data.frame': 47 obs. of 6 variables:
$ imp11:'data.frame': 47 obs. of 6 variables:
$ imp12:'data.frame': 47 obs. of 6 variables:
$ imp13:'data.frame': 47 obs. of 6 variables:
$ imp14:'data.frame': 47 obs. of 6 variables:
$ imp15:'data.frame': 47 obs. of 6 variables:
$ imp16:'data.frame': 47 obs. of 6 variables:
$ imp17:'data.frame': 47 obs. of 6 variables:
$ imp18:'data.frame': 47 obs. of 6 variables:
$ imp19:'data.frame': 47 obs. of 6 variables:
$ imp20:'data.frame': 47 obs. of 6 variables:
$ imp21:'data.frame': 47 obs. of 6 variables:
$ imp22:'data.frame': 47 obs. of 6 variables:
$ imp23:'data.frame': 47 obs. of 6 variables:
$ imp24:'data.frame': 47 obs. of 6 variables:
$ imp25:'data.frame': 47 obs. of 6 variables:
- attr(*, "class")= chr [1:2] "mi" "list"
We will fit a model over each of the versions, and try to find a best model over all perturbed datasets by an averaging.
In this case, we can estimate the overall standard error in terms of the variance of the parameters \( \beta \) derived over the different versions of the imputed data.
betas <- NULL
ses <- NULL
for(i in 1:chimp$m){
lmod <- lm ( involact ~ race + fire + theft +age , chimp $ imputations [[ i ]])
betas <- rbind ( betas , coef ( lmod ))
ses <- rbind (ses , coef ( summary ( lmod )) [ ,2])
}
(cr <- mi.meld(q=betas,se=ses))
$q.mi
(Intercept) race fire theft age
[1,] -0.2378402 0.007730138 0.03546648 -0.00873242 0.007218719
$se.mi
(Intercept) race fire theft age
[1,] 0.161498 0.002026873 0.008519401 0.004555294 0.002531306
cr$q.mi/cr$se.mi
(Intercept) race fire theft age
[1,] -1.472713 3.813825 4.163025 -1.916983 2.851776
Here the results are fairly similar, though in this case theft is no longer a significant parameter.
Many more techniques exist studying missing data, and this is just an introduction on how to approach this with additional references in Faraway.